Introduction

The objective of this project was to compile data indicative of current data science skills in demand. We did this by creating multiple tables containing strings from blog posts along with populating data from actual job postings.

In doing this, we sought to answer three questions:

  1. How do the skills vary between the different job posting sites?
  2. How does the skills looked for in job postings compare to the skills recommended in articles and blogs?
  3. Is there any difference in the 3 categories: Languages, Tech Data and Soft Skills.

Step 1: Listing skills based on job postings

Our first main step was creating a table that populated data based on 1) skills in demand from data blogs and 2) job postings with the skills listed.

Performing Web Scraping to fetch the skills using the node selector

We used the rvest package to use a node selector and subset the skills of interest:

# Start by reading a HTML page with read_html():
skills_page <- read_html("https://brainstation.io/career-guides/data-science-resume-examples#what-skills-should-you-put-on-a-data-science-resume")
skills_set1<- skills_page %>% html_nodes("ul:nth-child(4)") %>% html_elements("li")  %>% html_text2()
skills_set2<- skills_page %>% html_nodes("ul:nth-child(6)") %>% html_elements("li")  %>% html_text2()

From here, we can preview the vector:

writeLines(skills_set1)
## Data Courses
## Data Analytics
## Data Science
## Python
## Cybersecurity
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
## Research the company, the role, and relevant data skills
## Reference resume templates and samples to build a resume outline
## Add relevant education experience, work experience and data projects to the correct section of your resume
## Highlight experience with machine learning and data tools
## Craft concise bullet points using the action verb + task + result format for each experience, emphasizing data-driven successes
## Have a trusted peer proofread your Data Scientist resume for grammar and spelling to make sure your experience is professionally presented
## Work on projects in a collaborative setting
## Take advantage of our flexible plans and scholarships
## Get access to VIP events and workshops

The next step we took was trimming extra spaces from the words in skill_set1 and skill_set2.

vector1 = c()
for (i in 6:19) {
  vector1[i-5] <- as.character(trimws(skills_set1[i], which = c("both")))
}

vector2 = c()
for (i in 3:34) {
  vector2[i-2] <- as.character(trimws(skills_set2[i], which = c("both")))
}

Here are the cleaned vectors:

vector1 %>% writeLines()
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
vector2 %>% writeLines()
## R
## Python
## C
## C++
## C#
## HTML
## Java
## JavaScript
## PHP
## SAS
## SQL
## Scala
## MATLAB
## SQL Server
## NoSQL
## Hadoop
## OpenRefine
## TensorFlow
## Cloudera
## Tableau
## Microsoft Excel
## Octave
## Spark
## PowerBI
## Plotly
## Bokeh
## Matplotlib
## Seaborn
## Keras
## Pytorch
## AWS
## Hive

Soft Skills: We can start by reading a HTML page with read_html():

softskills_page <- read_html("https://zety.com/blog/data-scientist-resume-example")

From here, we can scrap these using node selector

skills_set3<- softskills_page %>% html_nodes("td:nth-child(1) > p > a") %>% html_text2()

skills_set3[3:9]
## [1] "Soft Skills"          "Communication"        "Collaboration"       
## [4] "Creative Thinking"    "Critical Thinking"    "Problem Solving"     
## [7] "Interpersonal Skills"

As before, we now have a vector of soft skills:

vector3 = c()

for (i in 3:9) {
  
  vector3[i-2] <- as.character(trimws(skills_set3[i], which = c("both")))
  
}

vector3
## [1] "Soft Skills"          "Communication"        "Collaboration"       
## [4] "Creative Thinking"    "Critical Thinking"    "Problem Solving"     
## [7] "Interpersonal Skills"

We can now combine the vectors with technical skills and the vectors containing soft skills.

title_vector <- c(vector1, vector2, vector3)

Adding the values in the columns

From there, we manually searched multiple job listing sites and populated tables based on the number of results:

tbl1.siteskills = data.frame(skill_title = title_vector,
                indeed = indeed_jobs,
                linkedin = linkedin_jobs,
                simplyhired = simplyhired_jobs,
                ziprecruiter = ziprecruiter_jobs
                )
head(tbl1.siteskills) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
skill_title indeed linkedin simplyhired ziprecruiter
Data analysis 46,915 580,822 27,586 1,685,359
Data wrangling 2,660 3,329 1,233 434
Data modeling 28,328 135,443 13,702 623,648
Statistics 2,327 216,875 44,456 288,970
Data visualization 41,727 119,686 37,409 170,268
Programming 9,444 510,656 87,187 2,047,037

We'll refer to this as Table 1 moving forward. We can export this table into a working directory:

# Get working directory path
path <- getwd()
path
## [1] "/Users/deepikadilip/git/Data_607_Project3"
write.csv(tbl1.siteskills, file.path(path, "skills_cleaned_data.csv"))

Step 2: Integration with Existing Tables

From here, we can join this table with existing data we've scraped:

Table 2: Skill Categories

tbl2.skillcat = read.csv('SkillCategories.csv')
head(tbl2.skillcat) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
skill_title SKILLID skill_category
Data analysis 1 Fundamentals
Data wrangling 2 Fundamentals
Data modeling 3 Fundamentals
Statistics 4 Mathematics
Data visualization 5 Fundamentals
Programming 6 Fundamentals

Table 4: Skill Recommendation Websites

tbl4.skillsites = read.csv('RecommendationSites.csv')
head(tbl4.skillsites, 2) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
RecommendationID URL Site Title Year
1 https://www.thebalancecareers.com/list-of-data-scientist-skills-2062381 the balance careers Important Job Skills for Data Scientist 2021
2 https://www.analyticsvidhya.com/blog/2020/11/14-must-have-skills-to-become-a-data-scientist-with-resources/ Analytics Vidhya 14 Must-Have Skills to Become a Data Scientist (with Resources!) 2020
In total, we have the following tables:
Table1
skill_title
indeed
linkedin
simplyhired
ziprecruiter
SKILLID
Table2
skill_title
SKILLID
skill_category
Table3
Skill
RecommendationID
InJobSkills
Table4
RecommendationID
URL
Site
Title
Year

Step 3: Answering Questions

Job Demand Distribution

We can start visualization by looking at the job demand distribution in Table 1. To do this, we'll need to melt the dataset prior to using ggplot:

mlt.tbl1 = pivot_longer(tbl1.siteskills, cols = c("indeed", "linkedin", "simplyhired", "ziprecruiter"), values_to = "count")
mlt.tbl1 = mlt.tbl1 %>% mutate(count = as.integer(gsub("\\,", "", count)))
mlt.tbl1.merged = mlt.tbl1 %>% merge(tbl2.skillcat, by = c("skill_title", "SKILLID"), all.x = T)

mlt.tbl1.merged$name = str_to_sentence(mlt.tbl1.merged$name)

plt.sites = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~name, ncol = 1, scales = "free_y") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.sites) #Makes it interactive

This plot is a bit detailed, so we can visualize this distribution in more granular "buckets":

mlt.tbl1.merged$skill_bucket = ifelse(mlt.tbl1.merged$skill_category == "Languages", "Languages", ifelse(mlt.tbl1.merged$skill_category == "Soft Skills", "Soft Skills", "Tech Data"))

plt.sites.aggregated = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket + name, ncol = 2, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.sites.aggregated, width = 800, height = 1500)

If we don't want to stratify by job website and visualize an overview instead, it will appear as the following:

tbl1.per_bucket = mlt.tbl1.merged %>% group_by(skill_title, skill_bucket) %>% summarise(count = sum(count))

plt.bucket = ggplot(tbl1.per_bucket, aes(skill_title, count, fill = factor(skill_bucket))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket, ncol = 1, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom") 

ggplotly(plt.bucket, width = 800, height = 1500)

As an extra bit, we can run some basic statistical tests across the board to see if any particular skills are listed more often in one website than others. We'll start by running a proportions test and seeing which skillls differ the most across websites:

indeed linkedin simplyhired ziprecruiter test.statistic
Communication 0.0060592 0.4471800 0.0040841 0.5426767 13330475
Hypothesis testing 0.0016622 0.0015483 0.0010033 0.9957862 8818391
Programming 0.0035580 0.1923865 0.0328472 0.7712084 5420010
Data analysis 0.0200433 0.2481422 0.0117855 0.7200290 4126554
Problem Solving 0.3125545 0.5271276 0.1508640 0.0094539 2876377

We'll do the same for those that differ the least among websites:

prop.table  %>%  arrange(desc(test.statistic)) %>% tail(5)  %>% kable() %>% kable_styling(bootstrap_options = "basic") 
indeed linkedin simplyhired ziprecruiter test.statistic
Octave 0.1481013 0.7936709 0.0417722 0.0164557 1701.61
Plotly 0.1740716 0.2115385 0.1143899 0.5000000 1417.67
Seaborn 0.2858482 0.3514527 0.2052484 0.1574508 126.02
Bokeh 0.2821369 0.3005008 0.1602671 0.2570952 37.33
OpenRefine 0.3557692 0.1826923 0.3269231 0.1346154 19.38

Key observations:

  • Soft skills were quite consistent throughout job listings. Communication in particular was emphasized across the board, followed by problem solving.
  • With respect to languages, job descrptions appeared to emphasize querying languages (such as SQL) over programming languages such as Python and R. However, when excluding SQL, both Python and Javascript appeared in a considerable number of job listings.
  • Software packages were different across the board. Seaborn and plotly had similar distributions across the job sites, but Matplotlib was seen more often in ZipRecruiter.